PURPOSE: The purpose of this document is to describe the incomplete retention data of principal investigators originally available to the Office of Sponsored Projects. This is a truncated report and demonstrates a stage in progress. It is useful as a milestone or reference but should not be used to draw conclusions.
OBJECTIVES:
All graphics produced in this report used incomplete data, and no conclusions should be drawn.
This (incomplete) data shows a decline in the estimated headcount of principal investigators starting in about 2020 due to reduced hires and increased departures. No investigation or explanation is attempted in this report.
It also shows that Cluster 5 “Prolific” PI’s (the workhorses of U research), and the population above the 90th percentile for funds won, are more stable with fewer departure counts than other clusters or percentiles.
Upon re-analyzing with complete data, valid conclusions can be drawn.
The retention data contains records for 4,485 principal investigators, where principal investigators are extracted from the table “osp.d_pi_vw” and are presumably identified according to the definitions and designations found in Rule R7-200B. The definition of a principal investigator and inclusion in this data set is being reviewed.
This data set includes all 2,937 principal investigators who submitted proposals after FY2013 as described in the Grants Exploratory project, and contains an additional 1,548 principal investigators who presumably submitted proposals before 2013. Although additional information from HR is forthcoming, and the designation of “principal investigator” in OSP data is being reviewed, it may be prudent to exclude the additional PI’s from future projects due to incomplete data on their proposal submissions.
The data contains up to three dates per PI: initial hire date, one re-hire date, and the most recent termination date. Because not all termination and re-hire dates are included, it is impossible to accurately tabulate head count. This prevents accurate calculation of metrics that use the headcount as a denominator, such as the turnover rate. Instead, re-hire dates are ignored and headcount is estimated using the initial hire date.
As well, action reasons (such as an explanation for termination) are not included, making it is impossible to distinguish “voluntary” and “involuntary” separation.
The earliest hire date reaches back to 1958, and the earliest termination date is in 1958. The largest interval between initial hire and rehire date is - years, and the largest interval between hire date and termination date is - years.
The data shows most hire or termination activity happening around July 1st.
Population head count appears to decline from 2020 onwards due to an increase in terminations and a decrease in hiring. Because re-hires are ignored, head count is roughly estimated.
Among the colleges, the count of departures increased dramatically in 2024 at Engineering and Science, and remained stable (though elevated) at the School of Medicine and at the Huntsman Cancer Institute.
The description of the clusters and percentiles is located at [NAME DOCUMENT].
Among the clusters, departure counts among PI’s not assigned to a cluster (due to too few proposals) are highest, and lowest in Cluster 5 (NAME THE CLUSTER.)
By percentiles, the 90th percentile or higher population has the lowest departure counts and what appears to be the most stable population. Errors introduced by this estimation aside, it introduces a chicken-or-egg question: is this population successful because they are stable, or stable because they are successful?
[Do I want to say something about the lower percentiles?]
Percentiles are re-calculated on an annual basis. About 40% of the PI’s consistently win funds within a five percentile range of themselves in any given year. However, a very strong alternating pattern was discovered where a PI wins at a high percentile in one year followed by a low percentile in the next year.
[how is that possible?]
The count of hires is fairly stable after 2005 until declining
Describe that this is an investigation into turnover of principal investigators.
Describe how principal investigators are defined; and reference some of the open questions that we’re checking.
Describe some details of the data (earliest hire date, number of PI’s.)
Describe some quirks (longest spans between hire and rehire, hire and termination, rehire and termination.)
At the time of the preparation of this report, the Office of Sponsored Projects had access to incomplete HR data. Although hire, departure, and net delta counts are possible, the head count had to be estimated, preventing accurate calculation of turnover rates. Additionally, due to a lack of action reasons, voluntary and involuntary separation could not be differentiated.
Generally, the data set shows an increase in estimated headcount followed by a decline starting in 2020. The College of Medicine has the largest headcount of PI’s, and the steepest decline. Huntsman Cancer Institute had a modest decline in head count starting in 2024.
The PI’s greater than the 90th percentile in funds requested won showed stability, with equivalent hires and fewer departures than PI’s greater than the 70th or 80th percentile.
I need to change everything to counts, and only interpret that.
Population definition – who got in here and why?
It consisted of a an intial hire date and re-hire or final termination dates as applicable. Because there was no initial termination date paired with the re-hire date, the re-hire date was ignored. This means that the headcount was calculated only using the initial hire date. Because the head count was wrong, metrics like turnover rate that rely on the head count are wrong.
VPR.D_PI_EMP_DT_VW as SELECT pi.”PI_DIM_KEY”,
pi.”PI_EMPLID”,
pi.”PI_FIRST_NAME”,
pi.”PI_MIDDLE_NAME”,
pi.”PI_LAST_NAME”,
pi.”PI_NAME”,
pi.”PI_EMAIL_ADDRESS”,
pi.”PI_PHONE”,
pi.”PI_INDICATOR”,
pi.”IS_PI”,
pi.”PI_LOAD_DATE_TIME”,
pi.”PI_UPDATE_DATE_TIME”,
emp.hire_dt,
emp.rehire_dt,
emp.termination_dt FROM osp.d_pi_vw pi LEFT JOIN
uuetl_hr.PS_UU_EMPLOYMENT_VW emp ON pi.pi_emplid = emp.emplid
| Name | retData |
| Number of rows | 4485 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 1 |
| POSIXct | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| PI_EMPLID | 0 | 1 | 1 | 9 | 0 | 4483 | 1 |
| PI_FIRST_NAME | 0 | 1 | 1 | 29 | 0 | 2125 | 0 |
| PI_MIDDLE_NAME | 0 | 1 | 1 | 20 | 0 | 994 | 1552 |
| PI_LAST_NAME | 0 | 1 | 2 | 22 | 0 | 3579 | 0 |
| PI_NAME | 0 | 1 | 5 | 50 | 0 | 4481 | 0 |
| PI_EMAIL_ADDRESS | 0 | 1 | 7 | 39 | 0 | 4482 | 0 |
| PI_PHONE | 0 | 1 | 5 | 16 | 0 | 1744 | 0 |
| PI_INDICATOR | 0 | 1 | 7 | 22 | 0 | 3 | 0 |
| IS_PI | 0 | 1 | 3 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| PI_DIM_KEY | 0 | 1 | 2860.88 | 2356.4 | -2 | 1135 | 2286 | 3429 | 7935 | ▇▇▃▁▃ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| PI_LOAD_DATE_TIME | 0 | 1.00 | 2020-05-12 16:02:38 | 2025-06-03 05:01:16 | 2022-04-07 14:09:46 | 452 |
| PI_UPDATE_DATE_TIME | 303 | 0.93 | 2023-06-16 12:37:17 | 2025-06-05 05:01:02 | 2025-01-31 05:01:10 | 199 |
| HIRE_DT | 4 | 1.00 | 1958-09-01 00:00:00 | 2025-05-01 00:00:00 | 2008-06-24 00:00:00 | 1777 |
| REHIRE_DT | 3228 | 0.28 | 1998-07-20 00:00:00 | 2027-02-15 00:00:00 | 2012-08-01 00:00:00 | 612 |
| TERMINATION_DT | 3013 | 0.33 | 2007-08-01 00:00:00 | 2025-06-01 00:00:00 | 2019-11-17 12:00:00 | 630 |
[FIXED discussion points] - Some aspects of this graph appear to be
mis-aligned due to how counts per interval were aggregated: - Axis
labels
- Points not along the main curve
- This needs to be investigated
[RELEVANT discussion points]
- The general pattern of an increasing count of PI’s that peaked around
the year 2020 and declined since needs to be confirmed
- Because the total money requested won has increased, this would mean
the average per PI has increased. This could be one double-check.
Discussion points:
I’m struggling if there’s a point to do this.
Maybe just to develop the mechanics?
Discussion points:
Discussion points:
By visual inspection, here’s a few things I’m noticing: First of all, the headcount is completely unexpected. At least the smallest bucket – 0 to 0.2 – has roughly double the headcount as the others. And the steepest decline (interesting!)
However, how does the top 90th percentile have the largest headcount? Clearly? Maybe they have the most stable population? I mean, that’s what this graph is showing, isn’t it? So the people who are in this elite group tend to stick around. That’s positive. (But don’t forget this is ‘incomplete’ data with a broad assumption.)
The headcount for the other lines really start to diverge at about 2020.
As far as the decline, could this mean that new PI’s simply haven’t had the time to enter into the 90th percentile? After all, these percentiles are calculated over a ten year time.
Which makes me wonder how much of this is circular reasoning. That is, I have a very large drop-off in PI’s at the bottom. And, are they at the bottom because they left? Are the PI’s at the top because they stayed? If a PI currently at the top left – then wouldn’t their lifetime total drop, and their lifetime percentile drop, and they wouldn’t be at the top?
Is it that they are stable, and that puts them in the top? Or is it that they are at the top, and that makes them stable?
A twin graphic to this is the amount won per year by percentile. That would make a reasonably nice line graph.
This portion is now fixed: I am not understanding in the slightest how I have the largest headcount in the 90th percentile and then going down from there.
This needs some looking into.
Fixed – by using piCluster[[2]] instead of piEmplid, where piCluster[[2]] uses the full population.
so now I am understanding it — I calculated the percentile based on the full population, but then filtered to just PI’s with at least three or four proposals, so my population is heavier at the higher percentiles. using piCluster[[2]] should correct that.
#3) COMBINE RETENTION AND PROPOSAL DATA
## [1] 304
Before I compare turnover by cluster, I need to add this descriptive bit:
all(prepData\(PROPOSAL_PI_EMPLID %in% retData\)PI_EMPLID) # TRUE (good)
table(retData\(PI_EMPLID %in% prepData\)PROPOSAL_PI_EMPLID) FALSE TRUE 1538 2937
And discuss how there are more PI’s in the retData than are in my proposal data set, and that talking about clusters introduces a filter.
Presumably these 1,538 “FALSE” ones haven’t proposed after 2013.
It’s probably a good time to disect that query a little more. …. and here’s what I’m seeing:
a “left join” onto osp.d_pi_vw.
So now I want to know how osp.d_pi_vw was created (how is it defined?) and how did it get 1,538 PI’s that aren’t in my proposal data?
Let’s check that out –
My “prepData” had some cleaning on it.
…and how worth is my time to chase this down when I’m likely to get new information and better data after talking to HR? I think I’ll leave this one down.
It does raise the question. From my complexMetrics, what’s the first hire date?
NOTES FROM REVIEW MEETING:
This report was reviewed on 4.30.2025 with the following action item and summary.
Action Items: - Dave to investigate negative and zero values for rehire-to-termination intervals (after Bill sends a list.) - Dave to investigate: - Obtaining information that could describe voluntary and involuntary separation (who was fired vs who quit) - Obtaining birthdates from HR (so we can see if the declining number of PI’s correlates with old age and retirement.) - Bill to investigate: - Proposal submission dates against the hire/rehire/termination intervals - Number of PI’s submitting each year - Tighten up interval definitions and aggregations - Fix x-axis graph labeling - “Tendrils” on the per-week graph of active PI’s - Compare the mean proposal award and see if it is climbing (to verify the declining number of PI’s since COVID.) - Compare counts of hire/rehire/termination dates on the same graphic - Compare PI’s submitting proposals against this view for completion’s sake
Summary: - Bill asks how the view with the HR data is defined; what
defines a PI for inclusion in this view? - Rehire date is after the
termination date in one case, and on the same day for a few others. Dave
will investigate.
- The lack of an initial termination date to pair with the initial hire
date causes some confusion and guess-work as to when people are actually
actively researching. For example, if someone worked for a semester as a
janitor as an undergrad, and then came back 15 years later as a
researcher, the data will count her in the denominator of the turnover
calculation for those 15 years. This means the turnover calculation is
incorrect. - Bill wonders: Can we identify voluntary vs involuntary
separation? –> Dave thinks there might be a “reason for update” field
with a code attached to it in PS_JOB. - The number of active PI’s has
declined since COVID. Dave thinks this may be age and retirement. Dave
will try and get the birthday so we can include. - The turnover has
increased every year, dramatically increasing in 2024. - This is a
quick’n’dirty report with some problems: - The definition of “active
researchers” has problematic assumptions as described above. - The
headcount of active researchers (between hire and termination date) by
week has “tendrils” that need to be explained. - X-axis labels are
shifted and interval aggregations need to be double-checked. -
Kruskal-Wallis is not a very sensitive test nor the most appropriate for
a time series - A visual inspection of the “per cluster” trends shows
two clusters moving together, one cluster consistently below the others,
and two clusters with volatile and large turn-over.
CONCLUSIONS AND NEXT STEPS:
Several aspects of this report were intriguing:
As well, many aspects of this report deserve a better treatment, including attempting to eliminate the guess-work introduced by the hire/rehire dates.
Due to these reasons, it was decided to continue investigating turnover by principal investigators.
Duration thoughts:
Notes after the initial over-all headcount metrics
I need some words like
[x] number of PI’s have hire and rehire dates. This creates uncertainty in the data as there is only one termination date.
Therefore, only the hire date is used. And the re-hire date is completely ignored.
This means that individuals are counted after their hire date. “Headcount” includes individuals who are after their hire date and before their termination date.
It means I am including someone who might have worked as a janitor during their undergraduate years, quit upon graduation, and have been elsewhere.
It also means that I am including someone who
(room with hallway graphic (?))
Let’s show a graphic with the re-hires removed completely.
I really need to emphasize my uncertainty with this data and graphic. There’s a strong guess here, or this reflects an assumption that is wrong.
Probably a discussion on how my hire and departure rates are calculated on an avg headcount basis.
And, what if this is COMPLETELY wrong? Like, laughably and insanely wrong?
It might be a good time to walk through the SQL logic, too.
I should pull up a specific example of where I think they were a student, and where they were re-hired after their career.